arXiv:l504.03655v4 [cs.LG] 10 Jan 2016 


Scale Up Nonlinear Component Analysis with 
Doubly Stochastic Gradients 
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Abstract 

Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel 
Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, 
but they can not scale up to big datasets. Recent attempts have employed random feature approximations 
to convert the problem to the primal form for linear computational complexity. However, to obtain high 
quality solutions, the number of random features should be the same order of magnitude as the number 
of data points, making such approach not directly applicable to the regime with millions of data points. 

We propose a simple, computationally efficient, and memory friendly algorithm based on the “doubly 
stochastic gradients” to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA 
and SVD. Despite the non-convex nature of these problems, our method enjoys theoretical guarantees 
that it converges at the rate 0(l/t) to the global optimum, even for the top k eigen subspace. Unlike 
many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big 
datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and 
real world datasets. 


1 Introduction 

Scaling up nonlinear component analysis has been challenging due to prohibitive computation and memory 
requirements. Recently, methods such as Randomized Component Analysis m are able to scale to larger 
datasets by leveraging random feature approximation. Such methods approximate the kernel function by 
using explicit random feature mappings, then perform subsequent steps in the primal form, resulting in linear 
computational complexity. Nonetheless, theoretical analysis Ha ns] shows that in order to get high quality 
results, the number of random features should grow linearly with the number of data points. Experimentally, 
one often sees that the statistical performance of the algorithm improves as one increases the number of 
random features. 

Another approach to scale up the kernel component analysis is to use stochastic gradient descent and 
online updates [HE [20]. These stochastic methods have also been extended to the kernel case mmm- 
They require much less computation than their batch counterpart, converge in 0(l/t) rate, and are naturally 
applicable to streaming data setting. Despite that, they share a severe drawback: all data points used in the 
updates need to be saved, rendering them impractical for large datasets. 

In this paper, we propose to use the “doubly stochastic gradients” for nonlinear component analysis. 
This technique is a general framework for scaling up kernel methods [8] for convex problems and has been 
successfully applied to many popular kernel machines such as kernel SVM, kernel ridge regressions, and 
Gaussian process. It uses two types of stochastic approximation simultaneously: random data points instead 
of the whole dataset (as in stochastic update rules), and random features instead of the true kernel functions 
(as in randomized component analysis). These two approximations lead to the following benefits: 
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• Computation efficiency The key computation is the generation of a mini-batch of random features 
and the evaluation of them on a mini-batch of data points, which is very efficient. 

• Memory efficiency Instead of storing training data points, we just keep a small program for regen¬ 
erating the random features, and sample previously used random features according to pre-specified 
random seeds. This leads to huge savings: the memory requirement up to step t is O(f), independent 
of the dimension of the data. 

• Adaptibility Unlike other approaches that can only work with a fixed number of random features 
beforehand, doubly stochastic approach is able to increase the model complexity by using more features 
when new data points arrive, and thus enjoys the advantage of nonparametric methods. 

Although on first look our method appears similar to the approach in [8], the two methods are funda¬ 
mentally different. In [8], they address convex problems , whereas our problem is highly non-convex. The 
convergence result in [5] crucially relies on the properties of convex functions, which do not translate to 
our problem. Instead, our analysis centers around the stochastic update of power iterations, which uses a 
different set of proof techniques. 

In this paper, we make the following contributions. 

• General framework We show that the general framework of doubly stochastic updates can be applied 
in various kernel component analysis tasks, including KPCA, KSVD, KCCA, etc.. 

• Strong theoretical guarantee We prove that the finite time convergence rate of doubly stochastic 
approach is 0(l/t). This is a significant result since 1) the global convergence result is w.r.t. a non- 
convex problem; 2) the guarantee is for update rules without explicit orthogonalization. Previous works 
require explicit orthogonalization, which is impractical for kernel methods on large datasets. 

• Strong empirical performance Our algorithm can scale to datasets with millions of data points. 
Moreover, the algorithm can often find much better solutions thanks to the ability to use many more 
random features. We demonstrate such benefits on both synthetic and real world datasets. 

Since kernel PCA is a typical task, we focus on it in the paper and provide a description of other tasks 
in Section [6] Although we only state the guarantee for kernel PCA, the analysis naturally carries over to 
the other tasks. 


2 Related work 

Many efforts have been devoted to scale up kernel methods. The random feature approach [25, 23] ap¬ 
proximates the kernel function with explicit random feature mappings and solves the problem in primal 
form, thus circumventing the quadratic computational complexity. It has been applied to various kernel 
methods |To. ;8J |lB|, among which most related to our work is Randomized Component Analysis TTiJ. One 
drawback of Randomized Component Analysis is that their theoretical guarantees are only for kernel matrix 
approximation: it does not say anything about how close the solution obtained from randomized PCA is 
to the true solution. In contrast, we provide a finite time convergence rate of how our solution approaches 
the true solution. In addition, even though a moderate size of random features can work well for tens of 
thousands of data points, datasets with tens of millions of data points require many more random features. 
Our online approach allows the number of random features, hence the flexibility of the function class, to 
grow with the number of data points. This makes our method suitable for data streaming setting, which is 
not possible for previous approaches. 

Online algorithms for PCA have a long history. Oja proposed two stochastic update rules for approxi¬ 
mating the first eigenvector and provided convergence proof in [191120] , respectively. These rules have been 
extended to the generalized Hebbian update rules EH [mi] that compute the top k eigenvectors (the sub¬ 
space case). Similar ones have also been derived from the perspective of optimization and stochastic gradient 
descent E3[2]. They are further generalized to the kernel case Haim. However, online kernel PCA needs 
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to store all the training data, which is impractical for large datasets. Our doubly stochastic method avoids 
this problem by using random features and keeping only a small program for regenerating previously used 
random features according to pre-specified seeds. As a result, it can scale up to tens of millions of data 
points. 

For finite time convergence rate, [Jj proved the 0(1/t) rate for the top eigenvector in linear PCA using 
Oja’s rule. For the same task, [2B] proposed a noise reduced PCA with linear convergence rate, where the 
rate is in terms of epochs, i.e., number of passes over the whole dataset. The noisy power method presented 
in jlOj provided linear convergence for a subspace, although it only converges linearly to a constant error 
level. In addition, the updates require explicit orthogonalization, which is impractical for kernel methods. 
In comparison, our method converges in 0(l/t) for a subspace, without the need for orthogonalization. 


3 Preliminaries 

3.1 Kernels and Covariance Operators 

A kernel k(x,y) : X x X H» R is a function that is positive-definite (PD), i.e., for all n > 1, Ci,..., c n £ R, 
and X \,..., x n £ A, we have 


CiCjk(xi,Xj) > 0 . 

i,j= i 

A reproducing kernel Hilbert space (RKHS) J- on X is a Hilbert space of functions from X to R. T 
is an RKHS if and only if there exists a k(x,x') : X x X H > R such that \/x £ X,k(x,-) £ J 7 , and V/ £ 
J 7 , (/(•), k{x, •))jr = /( x). If such a k(x, x') exist, it is unique and it is a PD kernel. A function / £ T if and 
only if \\f\\% ■= (fJ)jr < oo. 

Given a distribution P(x), a kernel function k(x,x') with RKHS J 7 , the covariance operator A : T i—»• T 
is a linear self-adjoint operator defined as 

Af(-):=E x [f(x)k(x,-)], V/ £ J 7 , (1) 

and furthermore ( g , Af)jr = E x [f(x) g{x)], Mg £ T. 

Let F = (/i(-), ^* 2 (■)j..., fk(')) be a list of k functions in the RKHS, and we define matrix-like notation 

AF(-):=(Af i(-),...,A/ fc (.)), (2) 

and F T AF is a k x k matrix, whose th element is (/j, Afj)^. The outer-product of a function v £ F 
defines a linear operator vv T : T i —> T such that 

H T )/(-) := (vJ)jrv(-), V/6 J 7 (3) 

Let V = (ui(-),..., Vk(-)) be a list of k functions, then the weighted sum of a set of linear operators, 
{njul} i=1 ) can be denoted using matrix-like notation as 

k 

VZ k V T -^KvivJ (4) 

1=1 

where is a diagonal matrix with A i on the z-th entry of the diagonal. 

3.2 Kernel PCA 

Kernel PCA aims to identify the top k eigenfunctions V = (ui(-),... ,Vk(-)) for the covariance operator A, 
where V is also called the top k subspace for A. 
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Table 1: Example of kernels and their random feature representation 


Kernel 

k(x, x') 

<M*) 

p(w) 

Gaussian [22] 

exp( I"-*' 11 *) 

exp(— iui T x) 

27T"I exp(-J!^!i) 

Laplacian [22 

exp (— 11 a; — x'||i) 

exp (—iur x) 

Ffd 1 

A li=l 

Cauchy [22] 

Ffd 2 

iii=l l + 

exp(— iui T x) 

exp(- [w||i) 


2 1- " ( y/2v\\x-x’\\2\ V TS ( V 2 v\\x-x'\\ 2 \ 


2 d n d / 2 r(v+d/ 2 )( 2 vy , 2 u , , 2 II, . II 2 \-'+d /2 


r M V t ) v \ *■ ) 


T{v)l 21 ' \i 2 ' 47r IMIU 

Dot Product JT2] 

E”=o > 0 

VoNp^nZMx 

m = n] = J+T 

Polynomial [ST] 

((x,x') + c) p 

FFT~ 1 (©?_ 1 FFT(Cia:)) 

Cj = SjDj, Dj £ R dxd Sj £ R Dxd 

Exp-Semigroup [32 

exp(-/3 J2i =i \Jxi + x j ) 

exp(— lu t x) 

ntr 2^K _i exp(-^) 

Rec-Semigroup [32] 

n d \ 

Bj=l Xi+x ^+A 

exp(— uj t x) 

ntrAexp(-A Wi ) 

Arc-Cosine [7| 


(uj t x) n max(0, lo t x) 

27r 3 exp(— iUb) 


Dj is random {±1} diagonal matrix and the columns of Sj are uniformly selected from {ei, . . . , er>}. v and l are positive parameters. 


K u is a modified Bessel function. © stands for element-wise product. 6 — cos 1 Jn(0) = (— l) 71 (sin 0) 71 "*" 1 ( J in g ^ 


A function v is an eigenfunction of covariance operator A with the corresponding eigenvalue A if 

Av{-)= Av(-). (5) 

Given a set of eigenfunctions {u^} and associated eigenvalues {A,;}, where ( Vi,Vj)jr = Sij. We can denote the 
eigenvalue of A as 

A = VY, k V r + E l £_ l V7 (6) 

where V = (ui(-),... is the top k eigenfunctions of A, and S k is a diagonal matrix with the corre¬ 

sponding eigenvalues, V± is the collection of the rest of the eigenfunctions, and Ej_ is a diagonal matrix with 
the rest of the eigenvalues. 

In the finite data case, the empirical covariance operator is A = ^^2jk(xi,-)k{Xj,-) T or denoted as 
n Si &(®ij •) ® k(xi, •). According to the representer theorem, the solutions of the top k eigenfunctions of A 
can be expressed as linear combinations of the training points with the set of coefficients {cqjy =1 £ R", 

n 

v i = J2 a i k ( x j ,-) 
j=l 


Using Av(-) = Xv(-) and the kernel trick, we have 

KcXi A iCXj, 


where K is the n x n Gram matrix. 

The infinite dimensional problem is thus reduced to a finite dimensional eigenvalue problem. However, 
this dual approach is clearly impractical on large scale datasets due quadratic memory and computational 
costs. 

3.3 Random feature approximation 

The usage of random features to approximate a kernel function is motivated by the following theorem. 

Theorem 1 (Bochner). A continuous, real-valued, symmetric and shift.-invariant function k(x — x') on 
K d is a PD kernel if and only if there is a finite non-negative measure P(w) on such that k(x — x') = 
f Rd e luJ ( x ~ x ) dP(w) = / R d x j 0 2n j <fiw( x )‘/ > ui{y) d (P(w) x P(6)), where P(6) is a uniform distribution on [0, 2n], 
and <t> u (x) = v^cos (uj t x + b). 
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The theorem says that any shift-invariant kernel function k(x,y) = k(x — y), e.g., Gaussian RBF kernel, 
can be considered as an expectation of two feature functions (j) u {x) and </> w (y), where the expectation is taked 
over a distribution on the random frequency u> and phase b. 

We can therefore approximate the kernel function as an empirical average of samples from the distribution. 
In other words, 

k(x,y) « 

i 

where are i.i.d. samples drawn from from P(w) and P(6), respectively. 

The specific random feature functions and distributions have been worked out for many popular kernels. 
For Gaussian RBF kernel, k(x — x') = exp(—||a; — a/|| 2 /2cr 2 ), this yields a Gaussian distribution P(w) with 
density proportional to exp(—cr 2 ||w|| 2 /2); for the Laplace kernel, this yields a Cauchy distribution; and for 
the Martern kernel, this yields the convolutions of the unit ball [26j . Similar representation where the explicit 
form of 4>ui{x) and P(w) are known can also be derived for rotation invariant kernel, k(x, x') = k((x,x')), 
using Fourier transformation on sphere .2Bj. For polynomial kernels, k(x,x' ) = (( x,x') + c) p , a random 
tensor sketching approach can also be used m- See Table [l] for explicit representations of different kernels. 

4 Algorithm 

In this section, we describe an efficient algorithm based on the “doubly stochastic gradients” to scale up 
kernel PCA. KPCA is essentially an eigenvalue problem in a functional space. Traditional approaches convert 
it to the dual form, leading to another eigenvalue problem whose size equals the number of training points, 
which is not scalable. Other approaches solve it in the primal form with stochastic functional gradient 
descent. However, these algorithms need to store all the training points seen so far. They quickly run into 
memory issues when working with hundreds of thousands of data points. 

We propose to tackle the problem with “doubly stochastic gradients”, in which we make two unbiased 
stochastic approximations. One stochasticity comes from sampling data points as in stochastic gradient 
descent. Another source of stochasticity is from random features to approximate the kernel. 

One technical difficulty in designing doubly stochastic KPCA is an explicit orthogonalization step required 
in the update rules, which ensures the top k eigenfunctions are orthogonal. This is infeasible for kernel 
methods on a large dataset since it requires solving an increasingly larger KPCA problem in every iteration. 
To solve this problem, we formulate the orthogonality constraints into Lagrange multipliers which leads 
to an Oja-style update rule. The new update enjoys small per iteration complexity and converges to the 
ground-truth subspace. 

We present the algorithm by first deriving the stochastic functional gradient update without random 
feature approximations, then introducing the doubly stochastic updates. 

4.1 Stochastic functional gradient update 

Kernel PCA can be formulated as the following non-convex optimization problem 

max tr (G t AG) s.t. G T G = /, (7) 

where G := (g 1 ,..., and g l is the i-th function. 

The Lagrangian that incorporates the constraint is 

L(G, A) = tr ( G t AG ) + tr ((G t G - /) A) 

where A is the Lagrangian multiplier. The gradient of the Lagrangian w.r.t G is 

V g L = 2AG + G (A + A T ) . 
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Furthermore, from the optimality conditions 


2 AG + G (A + A t ) = 0, 
G t G -1 = 0, 


we can find A + A T = —2 G T AG. 

Plugging this into the gradient, it suggests the following update rule 

G t+ i =G t +r k (l-G t Gj)AG t . (8) 

Using a stochastic approximation for A: A t f(-) = f{x t ) k{x t , •), we have A t G t = k(x t , -)gj and Gj A t G t = 
gtgj , where g t = [gl(xt), ■ ■ ■ ,g^(xt)] ■ Therefore, the update rule is 

G t+ i = G t (l — r) t g t gj ) + Vtk(x t ,-)gj . (9) 

This rule can also be derived using stochastic gradient and Oja’s rule i l <71 2il . 

4.2 Doubly stochastic update 

The update rule Q has a fundamental computational drawback. At each time step t, a new basis k(xt,-) is 
added to Gt, and it is therefore a linear combination of the feature mappings of all the data points up to t. 
This requires the algorithm to store all the data points it has seen so far, which is impractical for large scale 
datasets. 

To address this issue, we use the random feature approximation k(x,-) ~ <p Ui (x)(j) Ui (-). Denote H t the 
function we get at iteration t, the update rule becomes 

Ht +1 = Ht ( I-gththt 1 ) + gt4>u H {xt)(t>u lt {-)h t T , (10) 

where h t is the evaluation of H t at the current data point: ht. = \h\{xt), ■ ■., ht(xt)\ T ■ 

Given H 0 = Vq, we can explicitly represent H t as a linear combination of all the random feature functions 


H t =J2^(-W +V 0 p, 

i 

where oti G are the coefficients, and /? = J|i<t ^ ~ Vihihi T ^j. 

The update rule on the functions corresponds to the following update for the coefficients 

— gtx/^ujt(.xt)ht 

at = at - rjt.aJh t h t , \/i<t 

The specific updates in terms of the coefficients are summarized in Algorithms 1 and 2. Note that in 
theory new random features are drawn in each iteration, but in practice one can revisit these random features. 


5 Analysis 

In this section, we provide finite time convergence guarantees for our algorithm. As discussed in the previous 
section, explicit orthogonalization is not scalable for the kernel case, therefore we need to provide guarantees 
for the updates without orthogonalization. This challenge is even more prominent when using random 
features, since it introduces additional variance. 

Furthermore, our guarantees are w.r.t. the top fc-dimension subspace. Although the convergence without 
normalization for a top eigenvector has been established before Hanoi, the subspace case is complicated by 
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Algorithm 1: {a*}* = DSGD-KPCA(P(a;), k) 


Require: P(w), (j>^{x). 

1 

for i = 1,... 

, t do 

2 

Sample xt 

~ P(a;). 

3 

Sample uii 

~ P(w) with seed i. 

4 

hi = Evaluate^, G 

5 

Vi4*uji 

(xi)hi. 

6 

a j = a j — 

Vi&Jhihi, for j = 1,..., i — 1. 

7 

end for 



Algorithm 2: h = Evaluate^, 

Require: P(w), (/>^{x). 

1 

Set h = 0 £ K l . 

2 

for i = l,... 

, t do 

3 

Sample w* 

~ P(w) with seed i. 

4 

h = h + <t> Ui (x)a.i. 

5 

end for 



the fact that there are k angles between fc-dimension subspaces, and we need to bound the largest angle. To 
the best of our knowledge, our result is the first finite time convergence result for a subspace without explicit 
orthogonalization. 

Note that even though it appears our algorithm is similar to [8] on the surface, the underlying analysis 
is fundamentally different. In jSj, the result only applies to convex problems where every local optimum is a 
global optimum while the problems we consider are highly non-convex. As a result, many techniques that [8] 
builds upon are not applicable. 

5.1 Notations 

In order to analyze the convergence of our doubly stochastic kernel PCA algorithm, we will need to define 
a few intermediate subspaces. For simplicity of notation, we will assume the mini-batch size for the data 
points is one. 

1. Let F t := (ft, ■■ ■, ft) be the subspace estimated using stochastic gradient and explicit orthogonaliza¬ 


tion: 


F t +i <- F t + rjt A t . Ft 



2. Let G t := (gj, ■ ■ ■ ,gt ) be the subspace estimated using stochastic update rule without orthogonaliza¬ 
tion: 


G t +i <— Gt + rjt (I — GtGj) A t G t . 


where A t Gt and GtGj A t Gt can be equivalently written using the evaluation of the function {gl} on 
the current data point, leading to the equivalent rule : 


Gt+i <— G t (I — gtgtgj ) + Vtk(x t , -)gj . 


( 11 ) 


3. Let G t := (gj ,... ,gj) be the subspace estimated using stochastic update rule without orthogonaliza¬ 
tion, but the evaluation of the function {gl} on the current data point is replaced by the evaluation 
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h t = [hi(x t )] T : 


G t +i 4 — Gt + rjtk(xt, ■)hj — ritGt.hthJ 

4. Let H t := (h\, ..., /i/) be the subspace estimated using doubly stochastic update rule without orthog- 
onalization, i.e.. the update rule: 

H t +1 t- H t + iit<t>ui t (xt)(t , ui t (-)hj - r) t H t h t hJ . (12) 

The relation of these subspaces are summarized in Table [2] Using these notations, we describe a sketch 
of our analysis in the rest of the section, while the complete proofs are provided in the appendix. 

We first consider the subspace G t estimated using the stochastic update rule, since it is simpler and its 
proof can provide the bases for analyzing the subspace H t estimated by the doubly stochastic update rule. 


Table 2: Relation between various subspaces. 


Subspace 

Evaluation 

Orth. 

Data Mini-batch 

RF Mini-batch 

V 

- 

- 

- 

- 

Ft 

ft(x) 

/ 

/ 

X 

G t 

9t{x) 

A 

/ 

X 

G t 

9t(x) 

A 

/ 

X 


h t (x) 

A 

/ 

/ 


5.2 Conditions and Assumptions 

We will focus on the case when a good initialization Vq is given: 

U 0 T V 0 = J, cos 2 9(V, V 0 ) > 1/2. (13) 

In other words, we analyze the later stage of the convergence, which is typical in the literature ( e.g ., [[28]). 
The early stage can be analyzed using established techniques (e.g, 0 ). 

Throughout the paper we suppose \k(x,x')\ < K,\<j> u (x)\ < </> and regard k and as constants. Note 
that this is true for all the kernels and corresponding random features considered. We further regard the 
eigengap Xk — A^+i as a constant, which is also true for typical applications and datasets. 

5.3 Update without random features 

Our guarantee is on the cosine of the principal angle between the computed subspace and the ground truth 
eigen subspace (also called potential function): cos 2 6(V,G t ) = min^ 

Consider the two different update rules, one with explicit orthogonalization and another without 


F t + 1 4- orth(F t + r/ t A t F t ) 

G t +i 4 — Gt + r)t (I — G t Gj ) A t Gt 


where A t is the empirical covariance of a mini-batch. Our final guarantee for Gt is the following. 


Theorem 2. Assume (13) and suppose the mini-batch sizes satisfy that for any 1 < i < t, ||T — A|| < 
(Xk — Afc+i)/8. There exist step sizes ifr = 0(l/i) such that 


1 - cos 2 0(V, G t ) = O(lft). 
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The convergence rate 0{\/t) is in the same order as that of computing only the top eigenvector in 
linear PC A [Jj. The bound requires the mini-batch size is large enough so that the spectral norm of A is 
approximated up to the order of the eigengap. This is because the increase of the potential is in the order of 
the eigengap. Similar terms appear in the analysis of the noisy power method m which, however, requires 
orthogonalization and is not suitable for the kernel case. We do not specify the mini-batch size, but by 
assuming suitable data distributions, it is possible to obtain explicit bounds; see for example ESI 13- 
Proof sketch We first prove the guarantee for the orthogonalized subspace F t which is more convenient 
to analyze, and then show that the updates for F t and Gt are first order equivalent so Gt enjoys the same 
guarantee. To do so, we will require lemma [3] and [4] below 

Lemma 3. 1 — cos 2 9(V, F t ) = 0(l/t). 

Let c 2 denote cos 2 9(V,F t ), then a key step in proving the lemma is to show the following recurrence 

> ct( 1 + 2r/ t (A k - Afc+i - 2 ||A - A t ||)(l - cf)) - 0(?? t 2 ). (14) 


We will need the mini-batch size large enough so that 2 ||A — A t || is smaller than the eigen-gap. 

Another key element in the proof of the theorem is the first order equivalence of the two update rules. 
To show this, we introduce F(G t ) <— orth (G t + r/ t A t G t ) to denote the subspace by applying the update rule 
of F t on G t - We show that the potentials of G t +1 and F{G t ) are close: 

Lemma 4. cos 2 9{V, G t +i) = cos 2 9(V, F(G t )) ± 0(r]f). 


The lemma means that applying the two update rules to the same input will result in two subspaces 
with similar potentials. Then by (14), we have 1 — cos 2 6(V, Gt) = 0(l/t) which leads to our theorem. The 
proof of Lemma[4]is based on the observation that cos 2 0(V,X) = A m i n (K T A'(X T X) _ 1 X T K). Comparing 
the Taylor expansions w.r.t. r] t for X = Gt+ i and X = F(G t ) leads to the lemma. 


5.4 Doubly stochastic update 

The H t computed in the doubly stochastic update is no longer in the RKHS so the principal angle is not 
well defined. Instead, we will compare the evaluation of functions from H t and the true principal subspace 
V respectively on a point x. Formally, we show that for any function v £ V with unit norm |h||jr = 1, there 
exists a function h in Fl t such that for any x, err := |u(a;) — h(x)\ 2 is small with high probability. 

To do so, we need to introduce a companion update rule: G t +i t— Gt + Vtk(xt, -)hj — rjtGththJ resulting 
in function in the RKHS, but the update makes use of function values from h t £ H t which outside the 
RKHS. Let w = G T v be the coefficients of v projected onto G, h = F[ t w, and z = G t w. Then the error can 
be decomposed as 


\v(x) — h[x)\ 2 = \v{x) — z(x) + z(x) — h(x)\ 2 < 2 \v(x) — z(x) | 2 + 2 \z(x) — h(x)\ 2 

< 2 k 2 (h — z\\^r + 2 \z(x) — h(x) | 2 . (15) 

S_ v _✓ V--✓ 

(I: Lemma|6j) (II: Lemma [Tj) 

By definition, ||v — z\\ 2 jr = \\v\\^r — ||^||^- < 1 — cos 2 0(V, Gt), so the first error term can be bounded by the 
guarantee on Gj, which can be obtained by similar arguments in Theorem [ 2 ] For the second term, note that 
Gt is defined in such a way that the difference between z(x) and h(x ) is a martingale, which can be bounded 
by careful analysis. 


Theorem 5. Assume (13) and suppose the mini-batch sizes satisfy that for any 1 < i < t, ||A — A,;|| < 
(Afe — Afc+i)/8 and are of order fl(ln |). There exist step sizes ip = 0(l/i), such that the following holds. If 
f2(l) = A k{GjGi) < Ai (GjGt) = 0(1) for all 1 < i < t, then for any x and any function v in the span of 
V with unit norm ||i'||j F = 1, we have that with probability at least 1 — <5, there exists h in the span of H t 
satisfying |u(a;) — h{x )| 2 = O (| In |) . 
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The point-wise error scales as 0(l/t) with the step t. Besides the condition that ||A — Aj|| is up to the 
order of the eigengap, we additionally need that the random features approximate the kernel function up to 
constant accuracy on all the data points up to time t, which eventually leads to fi(ln |) mini-batch sizes. 
Finally, we need GjGt to be roughly isotropic, i.e., G, is roughly orthonormal. Intuitively, this should be 
true for the following reasons: Gq is orthonormal; the update for Gt is close to that for Gt, which in turn is 
close to F t that are orthonormal. 

Proof sketch In order to bound term I in ( |T5| ), we show that 
Lemma 6.1 — cos 2 8(V, G t ) = O (| In |). 


This is proved by following similar arguments to get the recurrence (14), except with an additional error 
term, which is caused by the fact that the update rule for Gt+i is using the evaluation h t [xt) rather than 
gt(x t ). Bounding this additional term thus relies on bounding the difference between ht(x) — gt(x), which is 
also what we need for bounding term II in (15). For this, we show: 


Lemma 7. For any x and unit vector w, with probability >1 — 5 over (I? 4 ,a/), \gt(x)w — h t (x)w | 2 = 

OQln(f)). 


The key to prove this lemma is that our construction of Gt makes sure that the difference between g t (x)w 
and ht(x)w consists of their difference in each time step. Furthermore, the difference forms a martingale and 
thus can be bounded by Azuma’s inequality. See the supplementary for the details. 


6 Extensions 

The proposed algorithm is a general technique for solving eigenvalue problems in the functional space. 
Numerous machine learning algorithms boil down to this fundamental operation. Therefore, our method can 
be easily extended to solve many related tasks, including latent variable estimation, kernel CCA, spectral 
clustering, etc.. 

We briefly illustrate how to extend to different machine learning algorithms in the following subsections. 

6.1 Locating individual eigenfunctions 

The proposed algorithm finds the subspace spanned by the top k eigenfunctions, but it does not isolate the 
individual eigenfunctions. When we need to locate these individual eigenfunctions, we can use a modified 
version, called Generalized Hebbian Algorithm (GHA) |25| . Its update rule is 

G t+ i =G t + i lt A t G t - rj t Gt UT [G, t A t G t ] , (16) 

where UT [•] is an operator that sets the lower triangular parts to zero. 

To understand the effect of the upper triangular operator, we can see that UT [■] forces the update rule 
for the first function of Gt to be exactly the same as that of one-dimensional subspace; all the contributions 
from the other functions are zeroed out. 

9t+i = + VtA t gl - W gVA t g\ , (17) 

Therefore, the first function will converge to the eigenfunction corresponding to the top eigenvalue. 

For all the other functions, UT [■] implements a Gram-Schmidt-like orthogonalization that subtracts the 
contributions from other eigenfunctions. 

6.2 Latent variable models and kernel SVD 

Latent variable models are probabilistic models that assume unobserved or latent structures in the data. It 
appears in specific forms such as Gaussian Mixture Models (GMM), Hidden Markov Models (HMM) and 
Latent Dirichlet Allocations (LDA), etc.. 
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Algorithm 3: {a*,ft}* = DSGD-KSVD(P(a;),P(y), k) 
Require: P(w), <f> u {x). 

1: for i = 1,..., t do 

2 : Sample Xi ~ P(a;). Sample yt ~ P (y). 

3: Sample u)i ~ P(w) with seed i. 

4: Ui = T£valuate(xi,{aijy-} 1 ) £ R fc . 

5: Uj = Evaluate)?/,, {/3j}j=i) e 

6: W = + VillJ 

7 : ai = r]i<p Ui (xi)vi. 

8 : pi = r)i(j> Ui (yi)ui. 

9: aq = aj — rjiWaj , for j = 1,..., i — 1. 

10: Pj = Pj - rjiWPj, for j = 1,... ,i — 1. 

11: end for 


The EM algorithm [1] is considered the standard approach to solve such models. Recently, spectral 
methods have been proposed to estimate latent variable models with provable guarantees [11129], Compared 
with the EM algorithm , spectral methods are faster to compute and do not suffer from local optima. 

The key algorithm behind spectral methods is the SVD. However, kernel SVD scales quadratically with 
the number of data points. Our algorithm can be straightforwardly extended to solve kernel SVD. The 
extension hinges on the following relation 


0 A T 

' V ' 


' A T U ' 


' V ' 

A 0 

_ U _ 


AY 


U 


where UT,V T is the SVD of A. 

It is therefore reduced to the eigenvalue problem. Plugging it into the update rule and treating the two 
blocks separately, we thus get two simultaneous update rules 


Wt = U?AVt + V t T A T U t 

(18) 

Ut+i = U t + rj t (AV t — U t W t ), 

(19) 

V t+1 = V t + Vt ( A T U t - VtWt) ■ 

(20) 


The algorithm for updating the coefficients is summarized in Algorithm 3. 


6.3 Kernel CCA and generalized eigenvalue problem 

Kernel CCA and ICA [3] can also be solved under the proposed framework because they can be viewed as 
generalized eigenvalue problem. 

Given two variables X and Y, CCA finds two projections such that the correlations between the two 
projected variables are maximized. Given the covariance matrices Cxx, Cyy , and Cxy , CCA is equivalent 
to the following problem 

Cxx C X y 
Cyx Cyy 

where gx and gy are the top canonical correlation functions for variables X and Y, respectively, and a is 
the corresponding canonical correlation. 

This is a generalized eigenvalue problem. It can reformulated as the following non-convex optimization 
problem 

max tr (G t AG) , 
s.t. G t BG = I. 


gx 

9Y 


= (i 


G 


X.Y 


a 


YY 



gx 


9y 
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Algorithm 4: {a*, ft}* = DSGD-KCCA(P(x), P(y), fc) 
Require: P(w), <f> u {x). 

1: for i = 1, ..., t do 

2 : Sample Xi ~ P(a;). Sample yt ~ P(y). 

3: Sample Wi ~ P(w) with seed i. 

4: = Evaluate(ajj, G R fc . 

5: Vi = Evaluate^,, {^ }*.“*) G R k . 

6: W = UivJ + ViuJ 

7: a, = T]i<p Ui (Xi) [Vi - WUi]. 

8: Pi=r)i(l>ui{yi)[Ui-WVi]. 

9: end for 



Figure 1: Convergence for DSGD-KPCA on the dataset with analytical solution. 


Following the derivation for the standard eigenvalue problem, we get the foliowing update rules 

G t +i = G t + r k (/ - BG t Gj) AG t . (23) 

Denote Gf and GjT the canonical correlation functions for X and Y, respectively. We can rewrite the 
above update rule as two simultaneous rules 

Wt = GY T CyxG? + G? T C xy GY (24) 

G; x +1 = Gf + r h [C XY G- CxxGf Wj (25) 

Gr +1 = G] + r h [C yx G? - C YY GjW] . (26) 

We present the detailed updates for coefhcients in Algorithm 4. 

6.4 Kernel sliced inverse regression 

Kernel sliced inverse regression [14] aims to do sufficient dimension reduction in which the found low di¬ 
mension representation preserves the statistical correlation with the targets. It also reduces to a generalized 
eigenvalue problem, and has been shown to find the same subspace as KCCA m- 

7 Experiments 

We demonstrate the effectiveness and scalability of our algorithm on both synthetic and real world datasets. 
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Figure 2: Recovered top 3 eigenfunctions using DSGD-KPCA on the dataset with analytical solution. 

7.1 Synthetic dataset with analytical solution 

We first verify the convergence rate of DSGD-KPCA on a synthetic dataset with analytical solution of 
eigenfunctions [3lj . If the data follow a Gaussian distribution, and we use a Gaussian kernel, then the 
eigenfunctions are given by the Hermite polynomials. 

We generated 1 million data points, and ran DSGD-KPCA with a total of 262,144 random features. In 
each iteration, we use a data mini-batch of size 512, and a random feature mini-batch of size 128. After all 
random features are generated, we revisit and adjust the coefficients of existing random features. The kernel 
bandwidth is set as the true bandwidth of the data. 

The step size is scheduled as 





0 O 

1 + 9 it 


(27) 


where 9 0 and 9i are two parameters. We use a small 9\ « 0.01 such that in early stages the step size is large 
enough to arrive at a good initial solution. 

Convergence Figure [l] shows the convergence rate of the proposed algorithm seeking top k = 3 subspace. 
The potential function is calculated as the squared sine function of the subspace angle between the current 
solution and the ground-truth. We can see the algorithm indeed converges at the rate 0(l/t). 

Eigenfunction Recovery Figure [2] demonstrate the recovered top k = 3 eigenfunctions compared with 
the ground-truth. We can see the found solution coincides with one eigenfunction, and only disagree slightly 
on two others. 


7.2 Nonparametric Latent Variable Model 

In [29| , the authors proposed a multiview nonparametric latent variable model that is solved by kernel SVD 
followed by tensor power iterations. The algorithm can separate latent variables without imposing specific 
parametric assumptions of the conditional probabilities. However, the scalability of the algorithm was limited 
by kernel SVD. 

Here, we demonstrate that with DSGD-KSVD, we can learn latent variable models with one million data, 
achieving higher quality of learned components compared with two other approaches. 

DSGD-KSVD uses a total of 8192 random features, and in each iteration, it uses a feature mini-batch of 
size 256 and a data mini-batch of size 512. 

We compare with 1) random Fourier features with fixed 2048 functions, and 2) random Nystrom features 
with fixed 2048 functions. The Nystrom features are calculated by first uniformly sampling 2048 data points, 
and then evaluate kernel function values on these data points |16j . 
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Table 3: KCCA results on MNIST 8M (top 50 largest correlations) 


# of feat 

Random features 

Nystrom features 

corrs. 

minutes 

corrs. 

minutes 

256 

25.2 

3.2 

30.4 

3.0 

512 

30.7 

7.0 

35.3 

5.1 

1024 

35.3 

13.9 

38.0 

10.1 

2048 

38.8 

54.3 

41.1 

27.0 

4096 

41.5 

186.7 

42.7 

71.0 


DSGD-KCCA 

linear CCA 

corrs. 

minutes 

corrs. 

minutes 

43.5 

183.2 

27.4 

1.1 


The dataset consists of two latent components, one is a Gaussian distribution and the other follows a 
Gamma distribution with shape parameter a = 1.2. One million data point are generated from this mixture 
distribution. 

Figures [3] shows the learned conditional distributions for each component. We can see DSGD-KSVD 
achieves almost perfect recovery, while Fourier and Nystrom random feature methods either confuse high 
density areas or incorrectly estimate the spread of conditional distributions. 

7.3 KCCA MNIST8M 

We then demonstrate the scalability and effectiveness of our algorithm on a large-scale real world dataset. 
MNIST8M consists of 8.1 million hand-written digits and their transformations. Each digit is of size 28 x 28. 
We divide each image into the left and right parts, and learn their correlations using KCCA. Thus the input 
feature dimension is 392. 

The evaluation criteria is the total correlations on the top k = 50 canonical correlation directions calcu¬ 
lated on a separate test set of size 10000. Out of the 8.1 million training data, we randomly choose 10000 as 
an evaluation set. 

We compare with 1) random Fourier and 2) random Nystrom features on both total correlation and 
running time. We vary the number of random features used for both methods. Our algorithm uses a total 
of 20480 features. In each iteration, we use feature mini-batches of size 2048 and data mini-batches of size 
1024, and we run 3000 iterations. The kernel bandwidth is set using the “median” trick and is the same for 
all methods. Due to randomness, all algorithms are run 5 times, and the mean is reported. 

The results are presented in Table [3] We can see Nystrom features generally achieve better results than 
Fourier features. Note that for Fourier features, we are using the version with sin and cos pairs, so the real 
number of parameters is twice the number in the table, as a result the computational time is almost twice 
of that for Nystrom features. 

Our algorithm achieves the best test-set correlations in comparable run time with random Fourier features. 
This is especially significant for random Fourier features, since the run time would increase by almost four 
times if double the number of features were used. We can also see that for large datasets, it is important to 
use more random features for better performance. Actually, the number of random features required should 
grow linearly with the number of data points. Therefore, our algorithm provides a good balance between 
the number of random features used and the number of data points processed. 

7.4 Kernel PCA visualization on molecular space dataset 

MolecularSpace dataset contains 2.3 million molecular motifs J8]. We are interested in visualizing the dataset 
with KPCA. The data are represented by sorted Coulomb matrices of size 75 x 75 [IS]. Each molecule also 
has an attribute called power conversion efficiency (PCE). We use a Gaussian kernel with bandwidth chosen 
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(a) 



(b) 



(c) 

Figure 3: Recovered latent components (a) DSGD-KSVD, (b) 2048 random features, (c) 2048 Nystrom 
features. 
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(a) (b) 

Figure 4: Visualization of the molecular space dataset by the first two principal components. The color 
corresponds to the PCE value: bluer dots represent lower PCE values while redder dots are for higher PCE 
values, (a) Kernel PCA; (b) linear PCA. (Best viewed in color) 



Random feature dimension 


Figure 5: Comparison on KUKA dataset. 


by the “median trick”. We ran kernel PCA with a total of 16384 random features, with a feature mini-batch 
size of 512, and data mini-batch size of 1024. We ran 4000 iterations with step size r) t = 1/(1 + 0.001 * t). 

Figure [4] presents visualization by projecting the data onto the top two principle components. Compared 
with linear PCA, KPCA shrinks the distances between the clusters and brings out the important structures 
in the dataset. We can also see although the PCE values do not necessarily correspond to the clusters, higher 
PCE values tend to lie towards the center of the ring structure. 

7.5 Kernel sliced inverse regression on KUKA dataset 

We evaluate our algorithm under the setting of kernel sliced inverse regression |14l , a way to perform sufficient 
dimension reduction (SDR) for high dimension regression. After performing SDR, we fit a linear regression 
model using the projected input data, and evaluate mean squared error (MSE). The dataset records rhythmic 
motions of a KUKA arm at various speeds, representing realistic settings for robots El- We use a variant 
that contains 2 million data points generated by the SL simulator. The KUKA robot has 7 joints, and the 
high dimension regression problem is to predict the torques from positions, velocities and accelerations of 
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the joints. The input has 21 dimensions while the output is 7 dimensions. Since there are seven independent 
joints, we set the reduced dimension to be seven. We randomly select 20% as test set and out of the remaining 
training set, we randomly choose 5000 as validation set to select step sizes. The total number of random 
features is 10240, with mini-feature batch and mini-data batch both equal to 1024. We run a total of 2000 
iterations using step size r] t = 15/(1 + 0.001 * t). 

Figure[5]shows the regression errors for different methods. The error decreases with more random features, 
and our algorithm achieves lowest MSE by using 10240 random features. Nystrom features do not perform 
as well in this setting probably because the spectrum decreases slowly (there are seven independent joints) 
as Nystrom features are known to work well for fast decreasing spectrum. 


8 Conclusions 

We have proposed a general and scalable approach to solve nonlinear component analysis based on doubly 
stochastic gradients. It is simple, efficient and scalable. In addition, we have theoretical guarantees that the 
whole subspace converges at the rate 0(1/t) to the true subspace. Moreover, since its core is an algorithm 
for eigenvalue problems in the functional space, it can be applied to various other tasks and models. Finally, 
we demonstrate the scalability and effectiveness of our algorithm on both synthetic and real world datasets. 
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Appendix 

The appendix is organized as follows. Section [A] reviews notations, the definition of Kernel PCA and the 
update rules considered. Section |B] provides the sketch of the proof as in the paper. Section [C] provides the 
proof for the stochastic update rule, and Section [D] provides the proof for the doubly stochastic update rule. 

A Setting 

Notations Given a distribution P(x), a kernel function k(x,x') with RKHS F, the covariance operator 
A : T !->■ T is a linear self-adjoint operator defined as 

Af(-):=E x [f(x)k(x,-)\, V/eJ, (28) 

and furthermore 

( 9 ,Af)r=E x [f(x)g(x)], VgeF. 

Let F = (/i(-), /2(-)> • • • > /*:(•)) be a list of k functions in the RKHS, and we define matrix-like notation 

AF(-):= (A fA (29) 

and F T AF is a k x k matrix, whose (i,j)-th element is (fi^Afj)^. The outer-product of a function v € T 
defines a linear operator vv T : T H > T such that 

(vv T )f(-):=(v,f)rv(-), V/e J (30) 

Let V = (ui(-),..., Vk(-)) be a list of k functions, then the weighted sum of a set of linear operators, 
{vivj} i=1 , can be denoted using matrix-like notation as 

k 

KS fc K T := ^ A iVivJ (31) 

2—1 

where T, k is a diagonal matrix with A i on the i-th entry of the diagonal. 

Kernel PCA Kernel PCA aims to identify the top k eigenfunctions V = {v \(•),..., v k (-)) for the covariance 
operator A, where V is also called the top k subspace for A. 

A function v is an eigenfunction of covariance operator A with the corresponding eigenvalue A if 

Av(-) = A'u(-). (32) 

Given a set of eigenfunctions {vi} and associated eigenvalues {A^}, where ( Vi,Vj)jr = Sij. We can denote the 
eigenvalue of A as 

A = VY, k V T + VaXaVZ (33) 

where V = (vi (-),... ,v k (-)) is the top k eigenfunctions of A , and is a diagonal matrix with the corre¬ 
sponding eigenvalues, V± is the collection of the rest of the eigenfunctions, and Xj_ is a diagonal matrix with 
the rest of the eigenvalues. 
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Update rules The stochastic update rule is 


Gt.+i — Gt + r)t (/ — Gt.Gj ) A t Gt (34) 

where Gt := (g \,..., g k ) and g\ is the i-th function. Denote the evaluation of Gt at the current data point 
as 

gt = | \g\{x t ), . ■. ,gt(x t )\ e R k . (35) 

Then the update rule can be re-written as 

G t +i =G t (l — Vtgt.gJ) + mMx t ,-)gj. (36) 

The doubly stochastic update rule is 

H t+ 1 = H t (j-r] t h t h t T ^ + g t Kt( x t)K t {-)ht T , (37) 

where ht is the evaluation of H t at the current data point: 

h t = [h\{x t ),...,h k t {x t )} T (38) 


When larger mini-batch sizes are used, the update rule is adjusted accordingly. For example, when using 
B x t points } and B U}t features ju^ |, the update rule for H t is 


H t +1 <- H t + 


vt E by ( x t)<t> u b '(•) . h t ( x t)_ 


- r}tH t 




B Analysis Roadmap 


In order to analyze the convergence of our doubly stochastic kernel PCA algorithm, we will need to define 
a few intermediate subspaces. For simplicity of notation, we will assume the mini-batch size for the data 
points is one. 

1. Let F t := ( f },..., f k ) be the subspace estimated using stochastic gradient and explicit orthogonaliza- 
tion: 


F t+ i -f - F t + g t A t F t (39) 

Ft+1 <- F t+ 1 (>4iF t+ i) 

2. Let G t := (gl, ■ ■ ■ ,g k ) be the subspace estimated using stochastic update rule without orthogonaliza- 
tion: 

Gt+ i G t + Tjt {I ~ G t Gj ) A t G t - (40) 

where A t Gt and GtGjA t Gt can be equivalently written using the evaluation of the function {gj} on 
the current data point, leading to the equivalent rule: 


Gt+i 4— G t (i — mgtgj) + Vtk(x t , -)gj■ 


(41) 
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3. Let G t := (<?/, ■ ■ ■ ,gt) be the subspace estimated using stochastic update rule without orthogonaliza- 
tion, but the evaluation of the function {gl} on the current data point is replaced by the evaluation 
h t = [hi{x t )] T : 

G t +i <— Gt + r]tk(xt, ■ )hj — gtGt.hthJ (42) 

4. Let H t := (h\ ,..., h /) be the subspace estimated using doubly stochastic update rule without orthog- 
onalization, i.e., the update rule: 

H t + 1 <r- H t + ri t (t> Ut {x t )(t> Ut (-)hJ - r] t Hth t hJ. (43) 

The relation of these subspaces are summarized in Table [4] Using these notations, we describe a sketch 
of our analysis in the rest of the section, while the complete proofs are provided in the following sections. 

We first consider the subspace G t estimated using the stochastic update rule, since it is simpler and its 
proof can provide the bases for analyzing the subspace H t estimated by the doubly stochastic update rule. 

Table 4: Relation between various subspaces. 


Subspace 

Evaluation 

Orth. 

Data Mini-batch 

RF Mini-batch 

V 

- 

- 

- 

- 

F t 

ft{x) 

/ 

/ 

X 

G t 

gt{x) 

X 

/ 

X 

G t 

9t{x) 

X 

/ 

X 


h t (x) 

X 

/ 

/ 


B.l Stochastic update 

Our guarantee is on the cosine of the principal angle between the computed subspace and the ground truth 
eigen subspace V (also called the potential function), which is a standard criterion for measuring the quality 
of the subspace: 

cos 0(V, Gt) = mm L—— 

w ll G HI 

We will focus on the case when a good initialization Vq is given: 


U 0 T V 0 = /, cos 2 0(V, V 0 ) > 1/2. 


(44) 


In other words, we analyze the later stage of the convergence, which is typical in the literature ( e.g ., (2S|). 
The early stage can be analyzed using established techniques (e.g., i). 

We will also focus on the dependence of the potential function on the step t. For this reason, throughout 
the paper we suppose \k(x,x')\ < k, |</ w (a;)| < (f) and regard k and <j) as constants. Note that this is true for 
all the kernels and corresponding random features considered. We further regard the eigengap A& — A^+i as 
a constant, which is also true for typical applications and datasets. Details can be found in the following 
sections. 

Our final guarantee for Gt is stated in the following. 


Theorem [ 2 } Assume (44\ ) an d suppose the mini-batch sizes satisfy that for any 1 < i < t, ||j4 — At\\ < 
(Afe — Afc + i)/8. There exist step sizes ifo = 0(1/*) such that 


1 — cos 2 6(V, G t ) = O(lft). 
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The convergence rate 0(l/t) is in the same order as that when computing only the top eigenvector in 
linear PCA J3], though we are not aware of any other convergence rate for computing the top k eigenfunctions 
in Kernel PCA. The bound requires the mini-batch sizes are large enough so that the spectral norm of A 
is approximated up to the order of the eigengap. This is due to the fact that approximating A with A t 
will result in an error term in the order of ||A — A t \\, while the increase of the potential is in the order of 
the eigengap. Similar terms appear in the analysis of the noisy power method m which, however, requires 
normalization and is not suitable for the kernel case. We do not specify the mini-batch sizes, but by assuming 
suitable data distributions, it is possible to obtain explicit bounds; see for example [301 Ej. 

Proof sketch To prove the theorem, we first prove the guarantee for the normalized subspace F t which is 
more convenient to analyze, and then show that the update rules for F t and G* are first order equivalent so 
that Gt enjoys the same guarantee. 

Lemma [3| 1 — cos 2 9(V, F t ) = 0{l/t). 

Let c 2 denote cos 2 9(V, F t ), then a key step in proving the lemma is to show that 

c 2 t+1 > c 2 ( 1 + 2 Vt (X k - A fc+1 - 2 \\A - A t ||)(l - c 2 )) - 0(r , 2 ). (45) 


Therefore, we will need the mini-batch sizes large enough so that 2 ||A — A t || is smaller than the eigen-gap. 

Another key element in the proof of the theorem is the first order equivalence of the two update rules. 
To show this, we need to compare the subspaces obtained by applying the them on the same subspace Gt- 
So we introduce F(G t ) to denote the subspace by applying the update rule of F t on G t : 


F{G t ) ^G t + Vt A t G t 
F(G t ) <- F{G t ) \F(G t ) T F(G t ) 


1 - 1/2 


We show that the potentials of G t+ i and F(G t ) are close: 
Lemma 0 cos 2 0(V, Gt+i) = cos 2 0(V, F(Gt)) ± 0(% 2 ). 


The lemma means that applying the two update rules to the same input will result in two subspaces with 
similar potentials. Since cos 2 9(V, F(G t )) enjoys the recurrence in (45), we know that cos 2 9(V, G t +i) also 
enjoys such a recurrence, which then results in 1 — cos 2 9(V, G t ) = 0(l/t). 

The proof of the lemma is based on the observation that 


cos 2 9(V,X) = X min {V T X(X T X)~ 1 X T V). 


The lemma follows by plugging in X = Gt.+i or X = F(Gt) and comparing their Taylor expansions w.r.t. 
Vt- 


B.2 Doubly stochastic update 

For doubly stochastic update rule, the computed F[ t is no longer in the RKHS so the principal angle is not 
well defined. Since the eigenfunction v is usually used for evaluating on points x, we will use the following 
point-wise convergence in our analysis. For any function v in the subspace of V with unit norm ||i?||jr = 1, 
we will find a specially chosen function h in the subspace of F[ t such that for any x, 

err := \v(x) — h(x )| 2 
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is small with high probability. More specifically, the w is chosen to be G T v, and let z = Gtw and h = H t w. 
Then the error measure can be decomposed as 

Kz) - h{x )| 2 

= |u(x) — z(x) + z(x) — h(x) | 2 

< 2 |d(x) — z(x) | 2 + 2 |z(x) — h(x )| 2 

< 2k 2 ||n — z\\% + 2 \z{x) — h(x)\ 2 . (46) 

»-v- S -V-' 

(I: Lemma|6j (II: Lemma]?} 

The distance ||u — z\\jr is closely related to the squared sine of the subspace angle between V and G t - 
In fact, by definition, ||u — z\\jr = IMIJr — ||^||jr < 1 — cos 2 6{V, G t ). Therefore, the first error term can be 
bounded by the guarantee on Gt, which can be obtained by similar arguments as for the stochastic update 
case. For the second term, note that G t is defined in such a way that the difference between z{x) = G t (x)w 
and h(x) = H t (x)w is a martingale, which can be bounded by careful analysis. 

Overall, we have the following results. Suppose we use random Fourier features; see Similar bounds 
hold for other random features, where the batch sizes will depend on the concentration bound of the random 
features used. 


Theorem [5} Assume (. 44) and suppose the mini-batch sizes satisfy that for any 1 < i < t, ||A — A,;|| < 
(A*, — Afc+i)/8 and are of order fi(ln |). There exist step sizes rji = 0(l/i), such that the following holds. If 
0(1) = A k{GjGi) < Ai (Gj Gi) = 0(1) for all 1 < i < t, then for any x and any function v in the span ofV 
with unit norm ||f||jr = 1, we have that with probability >1 — 5, there exists h. in the span of H t satisfying 


|u(x) — h(x)\ 2 = O 



The point-wise error scales as 0(l/t) with the step t, which is in similar order as that for the stochastic 
update rule. Again, we require the spectral norm of A to be estimated up to the order of the eigengap, for 
the same reason as before. We additionally need that the random features approximate the kernel function 
up to constant accuracy on all the data points up to time f, since the evaluation of the kernel function on 
these points are used in the update. This eventually leads to f2(ln |) mini-batch sizes. Finally, we need Gj Gi 
to be roughly isotropic, i.e., Gi is roughly orthonormal. Intuitively, this should be true for the following 
reasons: Go is orthonormal; the update for Gt is close to that for Gt, which in turn is close to F t that are 
orthonormal. 


Proof sketch The analysis is carried out by bounding each term in (46) separately. As discussed above, in 
order to bound term I, we need a bound on the squared cosine of the subspace angle between V and Gt- 

Lemma [g] . 1 — cos 2 9(V, G t ) = O In |). 

To prove this lemma, we follow the argument for Theorem [2] and get the recurrence as shown in (451, 
except with an additional error term, which is caused by the fact that the update rule for Gt+i is using the 
evaluation h t (x t ) rather than g t (x t ). Bounding this additional term thus relies on bounding the difference 
between h t (x) — g t {x), which is also what we need for bounding term II in (46). For this purpose, we show 
the following bound: 


Lemma m . For any x and unit vector w, with probability > 1 

0(1 In (£)). 


S over ( V t ,oj t ), \g t (x)w — h t {x)w | 2 = 


The key to prove this lemma is that our construction of Gt makes sure that the difference between 
gt{x)w and ht{x)w consists of their difference in each time step. Furthermore, the difference in each time 
step conditioned on previous history has mean 0. In other words, the difference forms a martingale and 
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thus can be bounded by Azuma’s inequality. The resulting bound depends on the mini-batch sizes, the step 
sizes rji , and the evaluations hi{xi) used in the update rules. We then judiciously choose the parameters and 
simplify it to the bound in the lemma. The complication of the proof is mostly due to the interweaving of 
the parameter values; see the following sections for the details. 

C Stochastic Update 

To prove the convergence of the stochastic update rule, we first prove the convergence of the normalized 
version F t , and then we establish the first-order equivalence of the potential functions of the two update 
rules for F t and G t . Since the final recurrence result does not depend on higher order terms, this first-order 
equivalence establishes the convergence of the stochastic update rule without normalization. 

C.l Stochastic update with normalization 

We consider the potential function 1 — cos 2 9 ( V , F t ) and prove a recurrence for it. We first show this for the 
simpler case where at each step we use the expected operator A in the update rule (Lemma [8]), and then 
show this for the general case where A t can be different from A (Lemma |9]). Then the bound in Lemma [3] 
follows from solving the recurrence in Lemma [9j 

C.1.1 Update rule with expected operator 

The following lemma states the recurrence for the update rule which replace A t in the stochastic update rule 
with the expected operator A = E A t : 


F t +1 <- F t + rj t AF t 


(47) 


- 1/2 


F t +1 *- F t+ 1 (-Fi+i-Ft+i) 

Lemma 8. Let the sequence {Fi}i be obtained from the update rule then 

1 - cos 2 9 (V, F t+ 1 ) < [1 - cos 2 9 (V, F t )] [l - 2 Vt {X k - X k +i) cos 2 9 (V, F t )] + ft, 
where fit = + 3 rfiB 3 and X k and Afc+i are the top k and k + l-th eigenvalues of A. 

Proof. First note that the cosine of subspace angle does not change under linear combination of the basis 


cos 2 9 (V, F t+ i) = min 

w \\F t+ iW ‘'ll 

The update rule gives us 

V T F t+1 w 


/_ - \ — 1/2 
V T F t+1 (F? +1 F t+1 ) w' 


V T F t+1 w 


Ft +1 (Ufi-Ft+i 


- 1/2 


W' 


F t +iw 


(48) 


> ||U T F t wf + 2r] t <U T F t w,V T AF t w) 


(49) 


F t+ iiv < \\F t w\\ z +2r/ t (F t w,AF t w) + B\\F t w\\ 2 


(50) 
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Let w = w/\\F t w\\, u = F t w , so ||u|| = 1. Denote c = ||U T u|| and s = llV^ztll. According to the 


definition, we have c > cos 9 k (V,F t ). Keep expanding the update rule leads to 


V T F t+1 w 


> 


\V T F t w\\ 2 + 2r) t (V T F t w,V T AF t w) 


Ft+iw 


||-ftU)|| 2 + 2rj t ( F t w , AF t w) + B ||F t u>|| 2 rff 


\V T u\\ 2 + 2i lt ( V T u, V T Au) 


(51) 


1 + 2ij t ( u , Au) + Brf 


> | ||K t u|| 2 + 2 ijt (V T u, V T Au)^ {l — 2r] t (u, Au) — Bp 2 } 

> ||U T u||~ + 2 ijt (V t u 7 V t Au) — 2p t ||U T u|p ( u , Au) 

- WtB 2 - 2 rfB 3 

= c 2 + 2^ t {u 1 VV T Au — c 2 u T Au} — fa, 

= c 2 + 2p t u T (VV T - c 2 l) Au - fa 
= c 2 + 2 Vt u r (s 2 VV r - c 2 V±V?) Au - fa. 

Recall that A = VA k V T + V^A k+ - { V[. Then 

u T }s 2 VV T — c 2 V±Vj}) Au = s 2 u T VA k V T u — (^v^V^Ak+iV^u 

> A k s 2 c 2 - A fc+ i c 2 s 2 = s 2 c 2 (X k - Afc+i) 

The recurrence is therefore 

cos 2 9 ( V , F t+ 1 ) >c 2 + 2r) t s 2 c 2 (A fc - X k+1 ) - fa 

= c 2 (1 + 2p t (Afc - Afc+i) (l - c 2 )) - fa,. 

The first term is a quadratic function of c 2 : 

x (1 + a (1 — x )) 


(52) 


(53) 


(54) 


where x := c 2 and a = 2p t (Afc — Afc + i). It has two roots at 0 and 1 + A Therefore, if .) + 2 ' >1, it is a 
monotonic increasing function in the interval of [0,1]. 

Thus, if ?y t < 4 (A t .-A fc+1 ) ’ w hich holds for all t large enough, we have 


cos 2 9 (V, F t+ 1 ) > cos 2 9 ( V ., F t ) (l + 2r/ t (A fc - A fc+ i) (l - cos 2 6 (V, F t ))) - fa 
which leads to the lemma. 

C.1.2 Using different operators in different iterations 


(55) 

□ 


Now consider the case of stochastic update rule (391 where we use a mini-batch to approximate the expec¬ 
tation in each iteration. 


Lemma 9. Let the sequence {J 7 )}, be obtained from the update rule }39} ). then 

1 - cos 2 9 (V, F t+1 ) < [1 - cos 2 9 (V, F t )] [l - 2 Vt (X k - A fc+1 - \\A t - A||) cos 2 9 (V, F t+1 )] + fa , 
where fa = 5pfB 2 + 3 pfB 3 and X k and Afc+i are the top k and k + 1-th eigenvalues of A. 
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( 56 ) 


Proof. The effect of the stochastic update is an additional term in the recurrence 

cos 2 6 ( V ., F t+ 1 ) > c 2 + 2r) t u T ( s 2 VV T - c 2 V ± V?) Au + Z t - /3 t 

where 


Z t = 2 Vt u T (s 2 VV T - c 2 V ± Vl) (A t - A) u. (57) 

The effect of the noise can be bounded, i.e. 

Z* = 2 r lt s 2 u T VV T (A t -A) u- 2r lt c 2 u T V ± VZ (A t - A) u (58) 

= 2r)ts 2 u Y (VV T + hi) (A t - A) u - 2 r] t c 2 u T (Vj _V± + hi) (A t - A) u, 

where s 2 h = c 2 h are positive numbers such that VV T + l\I and V±V2 + hi are positive-definite. 

The generalized Rayleigh quotient leads to the inequality 

|u T (VV T + hi) (At - A) u\ < A u T (VV T + hi) U (59) 

< A (c 2 + h) 

where A is the largest generalized eigen-value that satisfies 

(VV T + IJ) (A t - A) x = A (VV T + hi) x. (60) 

Since VV T + hi is positive definite, we have A = \\A t — Al||. 

Similarly, we have 

K (V ± VZ + hi) (A t — A)u\ < ||At - A|| (s 2 + h) • (61) 

The noise term is thus bounded by 

Z t > —2r) t s 2 ||A t — A|| (c 2 -I- — 2r]tc 2 || A t — A|| ( s 2 + h) ■ (62) 

Note that h and h can be infinitely small positive so we can ignore them. 

Therefore, the recurrence is 

cos 2 9(V,F t+1 ) > c 2 + 2r] t s 2 c 2 (X k - A fc+ i) - 4 rj t || A t - A|| s 2 c 2 - /3 t (63) 

= c 2 (l + 2 ij t (A*, — Afe+i — 2 ||A* — A||) (l — c 2 )) — fit 

which then leads to the lemma. □ 


In order to get fast convergence, we need to take sufficiently large mini-batches such that the variance of 
the noise is small enough compared with the eigen-gap. 


C.2 Stochastic update without normalization 

We show that the cosine angles of the two updates are first-order equivalent. Then, since the recurrence 
is not affected by higher order terms, when the step size is small enough, we can show it also converges in 
0(l/t). 

To show the first order equivalence, we need to compare the subspaces obtained by applying the them 
on the same subspace G t . So we introduce F(Gf) to denote the subspace by applying the update rule of F t 
on Gt- 


F(Gf)^G t + r 1 tA t Gt 
F(G t ) <- F(G t ) \P(Gt) T F(G t ) 


(64) 


1 - 1/2 


Then the first order equivalence as stated in Lemma [4] follows from the following two lemmas for the nor¬ 
malized update rule (39) and the unnormalized update rule (641, respectively. 
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Lemma 10. cos 2 9 ( V ., F(G t )) = X m in (M + 0(rj 2 )) where 

M = V T PP T V + rjV T PP T AV + rjV T APP T V - 2tjV t PP T APP T V, 

where PP T = Gt {Gj Gt) 1 Gj, and P is an orthonormal basis for the subspace Gt. 

Proof. For simplicity, let G denote Gt, and let A denote A t in the following. We first have 

cos 2 9 (V, F(G )) = A min ( V T F(G)F(G) r V ) 

= A min ( F(G) T VV T F(G )) 


= A min \V T (G + rj t AG) (G + rj t AG) T (G + rj t AG) (G + rj t AG) T V}. (67) 


Note that (66) is due to the fact that 


Amin (FfGVW-FiG)) - min 


R- 1 (G + i lt AGY LV t (G + ip AG) Rr 


. z T (G + r/ t AG) T VV T (G + r/ t AG) z 

= mm-=— - 

-z z T R 2 z 

. z T (G + r/ t AG) T VV T (G + rj t AG) z 

= mm-=p- 

2 2 t (G + rj t AG) T (G + rj t AG) z 


|V T {G + rj t AG)z\ 
II (G + rj t AG) z\\ 2 


where R = 


(G + rj t AG ) 1 (G + ijtAG) 


-I 1/2 


Now turn back to (671. Expand the matrix-valued function 


</>(??) = (G + rjAG ) 1 (G + rjAG) 
= (f(0)+<p'(0)rj + O{rj 2 ). 


-1 


(65) 

( 66 ) 


( 68 ) 


So, 


Therefore, 


£'(0) =-2 (G T G) 1 G T n4G (G t G) 


(j)(rj) = (G t G) — 2r7 (G t G) 1 G 1 AG (G t G) ‘+0(r| 2 ). 


I “I 


V 1 (G + jfcAG) [(G + rjtAG) 1 (G + jfci4G)J (G + rj t AG) 1 V 

= (V T G + rj t V T AG) \(G T G)~ 1 -2 tj(G t G)^ G T AG (G T G)^ 1 +0{rj 2 )] (G T V + rj t G T AV) 

= V T G (G t G) _1 G t V + rjV T G (G t G) _1 G 1 AV + rjV T AG (G T G)~ l G t E 
- 2rjV T G (G t G) _ 1 G t AG (G t G) _1 G t V + 0(rj 2 ) 

= V T PP T V + rjV T PP T AV + rjV T APP T V - 2 rjV T PP T APP T V + 0(?7 2 ), 

where PP T = G (G T G) 1 G T , and P is an orthonormal basis for the subspace G. 


(69) 

(70) 

(71) 


a 
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Lemma 11. cos 2 0 ( V, Gt+i) = \ m in (M) where M is as defined in Lemma 
Proof. For simplicity, let G denote G t and let A denote A t . Then cos 2 0 ( V , G t + i) = A m i n (N), where 


N = V T G t+ 1 [Gj +1 G t+1 ] 1 Gj +1 V with G t+1 =G + V (l- GG T ) AG. 
Now it suffices to show N = M. Consider 


4>{rj) = (G + t}(I- GG t ) AG) (G + rj(l- GG T ) AG) 


Then 


0'(O) = - (G T G) 1 [G t (I - GG t ) AG + G t A (/ - GG t ) G] (G t G) 

Therefore, N is 

V T (G + r](l~- GG t ) AG) |~(G + rj (/ - GG T ) AG) T (G + r? (J - GG T ) AG) " 
x (G + rj (/ — GG t ) AG) T V 

= ( V t G + rjV T (/ - GG t ) AG) |~(G + r? (I - GG T ) AG) T (G + t](I- GG t ) AG) 
x ( G t V + rjG T A (I - GG t ) V) 

= (V T G + r]V T (/ — GG t ) AG) 

x [(g t g) _1 -t ? (g t g) _1 [g t ( i-gg t )ag + g t a(i-gg t )g] (G T cy 1 

x ( G t V + rjG T A (I - GG t ) V) 

= F t G (G t G) _1 G t 1/ + rjV T G (G t G) _ 1 G t A (/ - GG T ) V + rjV T (/ - GG T ) AG (G t G) _ 1 G t, F 
—r]V T G (G t G) _1 [G t (/ - GG t ) AG + G T A (/ - GG T ) G] (G t G) _ 1 G t V 
= V T PP T V + r)V T PP T A (/ - GG t ) V + i]V T (I - GG T ) APP T V 
-r]V T PP T (/ - GG t ) APP t V - rjV T PP T A (J - GG T ) PP T I/ 

= V T PP T V + r]V T PP T AV + r]V T APP T V - 2r/V T PP T APP T V 

—r]V T PP T AGG T V - r]V T GG T APP T V + r]V T PP T GG T APP T V + r]V T PP T AGG T PP T V 
= V T PP T V + r]V T PP T AV + r]V T APP T V - 2 r)V T PP T APP T V 


-l 


which completes the proof. 


□ 


D Doubly Stochastic Update 

In this section, we consider the doubly stochastic update rule. Suppose in step t, we use a mini-batch 
consisting of B x t random data points x£(l < r < B x<t ) and P Wjt random features w/(l < s < B u , t )- Then 
the update rule is 


H t +i = H t + rj t E t [(fujtixt)^{-)ht(x t )\ -r) t H t E t [h t (x t ) T h t (x t )] (72) 

= H t (I -r] t E t [h t (x t ) r h t (x t )]) + rj t E t [4> Ut (x t )<t) Ut {-)h t {x t )} (73) 

where for any function f(x,u), E t f(x t ,uf) denotes Uf= i f( x t ; u t)/(Bx,tB Ui t)- As before, we assume 

H 0 = F 0 is a good initialization, i.e., Fg F 0 = I and cos 2 0(Fq, V) > 1/2. Note that H t = [hj (-),..., h/(-)], 
while ht(xt) is its evaluation at Xt, i.e., ht(xt) is a row vector [hj(xt), ■ • •, hf {xt)\. 
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(74) 

(75) 


We introduce the following intermediate function for analysis: 

G t +i = G t + r] t E t [k(x t ,-)ht{x t )] - rjtG t E t [h t (x t ) T h t (x t )] 

= G t {I - r) t E t [h t {x t ) T h t (x t )]) + r) t E t [k(x t ,-)h t {x t )\. 

Again, G 0 = F 0 . 

The analysis follows our intuition: we first bound the difference between H t and Gt by a martingale 
argument, and then bound the difference between Gt and V. For the second step we can apply the previous 
argument. Note that Gt is different from F t since A t F t = k(xt, -)F t (xt) is now replaced by k{xt,-)ht{xt ), so 
we need to adjust our previous analysis. 

Suppose we use random Fourier features for points in see [25]. Then we have 

Theorem [5} Suppose the mini-batch sizes satisfy that for any 1 < i < t, ||A — Ai\\ < (A& — Afc+i)/8 and 
B x i = fi(ln|). There exist step sizes r)i = 0(1/*), such that the following holds. If f2(l) = A k(Gj Gi) < 
Ai (GjGi) = 0(1) for all 1 < i < t, then for any x and any function v in the span of V with unit norm 
H^lljr = 1, we have that with probability >1 — 5, there exists h in the span of H t satisfying 

Kz) ^ h(x)\ 2 = O Q ln 0 • 

Proof. Let w = Gjv, z = Gtw , and h = H t w. 

|u(a:) — h( x)\ 2 = \v(x) — z( x) + z(x) — h.(x)\ 2 

< 2 |u(a:) — z(x) | 2 + 2 \z(x) — h(x) | 2 

< 2 11^ - z\\% || k(x, -)\\% + 2 \z(x) - h(x) | 2 

< 2k 2 ||t; — z\\ 2 jr + | z(x) — h(x) \ 2 . 


Roughly speaking, the difference between v and z is the error due to random data points and can be bounded 


by Lemma 15 whil e th e difference between z{x) and h[x) is the error due to random features and can be 
bounded by Lemma 13 More precisely, since z is the projection of v on the span of G t , 


12 < 1-cos 2 6(G U V) =0 ( jln^ 


13 


where the last step is by Lemma [ljj Also, since ||w|| < 1, we have \z{x) — h(x) | 2 = O (| In |) by Lemma 
What is left is to check the mini-batch sizes; see the assumptions in Lemma [T2| and Lemma fl5| We neec 
Afe(Ej [hi(xi) T hi(xi )]) = Afc(Ea, [hi(x) r hi(x)] ) ± 0(1), so we only need to estimate E^ hi {x) T h[{x) up to 

constant accuracy for all 1 < j, £ < k, for which B Xj i = 0(ln |) suffices. We also need A u = 0(Afc — Afc+i) = 
0(1), so we only need = 0(1). This is a bound for {tB x ji ) 2 pairs of points, for which B Uii = 0(ln |) 


suffices. 


□ 


Similar bounds hold for other random features, where the batch sizes will depend on the concentration 
bound of the random features used. 

The rest of this section is the proof of the theorem. For simplicity, ll-IU is shorten as ||-||. 

First, we bound the difference between H t and G t . 


Lemma 12. 

k(xi,xj) - J] 
probability > 1 


Suppose \k{x,x')\ < K,\(f) u (x)\ < </>. Suppose the mini-batch sizes are large enough so that 


s — 1 f^uj s {Xi)(f>uj s {Xj^}l^duj^i 
— 8 over (V 1 , w 4 ), 


< for all sampled data points Xi and Xj. For any w and x, with 


\g t+1 {x)w 


h t+ i(x)w\ 2 < B 2 +1 := -A 2 In 


t 

y |Ej \hi(xi)\ a t ,iW 
2=1 
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where at. 


k 


,1 = Vi YlU+1 i 1 - [ h j( x j) r hj( x i)]) f or 1 - * - and I hi( x i)\ 

Proof. Note that 


H( x i) 


t 


-^cfc+1 ^ [4*uji (’)^(*^)] ^ t,i H - -^0^t,Cb 

i= 1 

(76) 

t 

(77) 


i=l 


where a tj0 = n*=i i 1 ~ Vj K j [ h j( x j) T hj( x j)]). 

We have g t+1 (x)w - h t+1 (x)w = J2l=i v t,i( x ) where 

V t ,i{x) = Ej [, k(xi,x)hi(xi) - 4> aJi (xi)(j) U]i (x)hi(xi)} a t ,iW. 

V tj i{x ) is a function of (V\uj 1 ) and 

[V r t, i (x)|u;' _1 ] = E V i tU i-iE Ui [V t ,i(x) |w ,_1 ] =0, 

so {Vj^x)} is a martingale difference sequence. 

Since |Vt j i(x)| < A W |E,; |/ij(x,)| Ot,iw|, the lemma follows from Azuma’s Inequality. □ 

So to bound \g t (x)w — h t {x)w |, we need to bound |Ej \hi{xf)\ at,%w\, which requires some additional 
assumptions. 

Lemma 13 (Complete version of Lemma [7|. Suppose the conditions in Lemma [7D| are true. Further sup¬ 
pose for all i < t, r]i = 9/i where 9 is sufficiently large so that 0Afc(Ej [/ij(xj) T /ij(xj)]) > 1/ also suppose 

Ar (gTG 4 ) =0(1). 

Wif/i probability > 1 — d over for all 1 < i <t and £ £ [k], we have 

|fff(x i )-/ i |(x i )| 2 = o(^ln0)). 

f£) For any x and unit vector w, with probability > 1 — S over 

\g t (x)w - h t (x)w\ 2 = O In Qjj . 

Proof. We first do induction on statement (1), which is true initially. Assume it is true for t, we prove it for 
t + 1. 

We have that for any unit vector w, 


|Ej \hi(xi)\a t}i w\ = 


Vi E,: \hi(xi)\ n [I- Vj^j [hj{xj) r hjixj)]] w 


j—i+l 


< r)i ||Ej \hi(xi)\\\ |MI \\l-VjEj [h j (x j ) T hj(xj)\ \ 


j=i +1 


ri ('-))=°(t)- 


j=i +1 


31 










We use in the second line 


0 2 , t 


hi(xi)\\ < O [ \J — In - ] + ||ffi(ari)|| < O ( \/ — In - ) + 


6» 2 , t 


t 5 


GJ G, 


ll^i)ll = 0(9) 


that holds with probability 1 — t6/(t + l) by induction, and we use in the last line 9Xf : (E i [hi(xi) T hi(xi )\) > 1. 
Then by Lemma 12 with probability > 1 — S/(k(t + 1)), 

\g t +i(x t+ i)w - h t+ i(x t+ i)w \ 2 < ^A 2 In — j ^ |E* \hi(xi)\ a M w| 2 

' ' i= 1 


<0(A 2 )ln 


t + l\ + 1 


H t 2 ° 


\ t ~\~ 1 


Repeating the argument for k basis vectors w = e,(l < i < k) completes the proof. 
The other statement follows from similar arguments. 

Next, we bound the difference between Gt and V. 


□ 


Lemma 14. Suppose the conditions in Lemma 13 are true and furthermore, A k(Gj Gf) = 12(1) for alii £ [f]. 
Let c 2 denote cos 2 9(G t , V). Then with probability >1 — 5, 


c t+i >Ct <l + 2r/ t 


Afc — Afc+r — 2 11 .A* — A\\ — O \ ^9\j-\n- 


1, t 


(l - c 2 ) - O ryA^ 


1 — c? t 


ln T -A 


where /3t is as defined in Lemma [S| 


Proof. The potential of Gt can be computed by a similar argument as in the previous section; the only 
difference is replacing A t u with k(xt,-)h t (xt)w. This leads to 

cos 2 9(G t +i, V) >c 2 + 2?y t u T (s 2 VV T - c 2 V±V±) k(x t , ■)h t (x t )w - /3 t 

= c 2 + 2r/ t u T (s 2 VV T - c 2 V±Vj) [(k(x t , -)h t (x t )w - A t u) + (A t u - Au) + Au] - /3 t (78) 

where u = Gtw with unit norm ||u|| = 1. 

The terms involving (A t u — Au) and Au can be dealt with as before, so we only need to bound the extra 
term 

u T (s 2 VV T - c 2 V±Vj) [k(x t , ■)ht(x t )w — A t u} 

= u T (s 2 VV T - c 2 V±Vj) [k(x t ,-)h t (x t )w - k(x t ,-)g t (x t )w] 

= u T (s 2 VV T - c 2 V ± V±) k(x t , ■)[h t (x t ) - g t ( x t )]ui. 

So we need to bound [h t (x t ) — gt(x t )]w, which in turn relies on Lemma 13 More precisely, we have 
\\h t (x t ) — fft^Olloo — O (^A u 6 2 yj 1/ t^j with probability >1 — 5. Also, we have u = G t w has unit norm, so 
||w|| = 0(1) when \ k (GjGf) = 11(1). Then 

\u T VV T k(x t , -)[ht(xt) - g t (x t )]w\ < ||u T P|| ||A:(x t , -)|| O (a u 9 2 y/ljtj < c 2 d (a^O 2 y/TJt^ 
where the last step follows from c > 1/2 by assumption. Similarly, 


■ T V±Vlk(x t , •)[ h t (x t ) - g t (x t )\w I < llu 1 


Vl|| ||fc(x t ,-)||o(A w 0 2 x /lA) < sO (A = d\A ui e 2 ^—^j . 


Plugging into (781 and apply a similar argument as in Lemma [8] and [9] we have the lemma. 


a 
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Lemma 15 (Complete version of Lemma[6|. If the mini-batch sizes are large enough so that ||^4 — Ai\\ < 
(A fe - A fc+ i)/8, A fc (Ei [hi(xi) r hi(xi)]) = X k (E x [hi(x) T hi(x)\) ± 0(1), and = 0(X k - A fc +i), then 


( 1 ) 9 = 0 ( 1 ); 

(2) l-c? = 0(iln|). 


Proof. If the mini-batch size is large enough so that Afc(Ej [hi(x-i) T hi(xi )\) = 
we only need to show Aj^E^ [hi(x) T hi(x )]) = 11(1), which will lead to 6 

leads to 1 — cos 2 6(G t +i, V) = 0(l/t). 


recurrence in Lemma 


14 


Let ei {x) = hi ( x) — gi (x). Then 


X k (E x [h i {x) T h i (x)])±0{l), 
= 0(1) and then solving the 


E* [hi(x) T hi(x)\ = E X [ 5 , ; (a;) T gi(a;)] + 2E K [e^(a:) T (cc)] - E x [ei(x) T ei{x)\ . 


By Lemma 13 E^ ej(x ) = O(0 4 /t), which is o(l) if 9 = 0(1). Then the norm of 2 E.e [ei(x) T hi(x)\ — 

E a , [e*(x) T e,(a;)] is o(l), so we only need to consider E x [gi(x) T gi(x)]. 

Formally, we prove our statements (1)(2) by induction. They are true initially. Suppose they are true 
for t — 1, we prove them for t. 

First, by solving the recurrence for ct, we have that statement (2) is true up to step t. 

Next, since E x [g t (x) T g t {x)] = Gj AG t , we have 


w t E x [g t (x) T g t (x)] w =w T GjAG t w 

=w T Gj ( VA k V T + V ± A ± V? )G t w 
>w T GjVA k V T G t w 

>x k <? t IMI 2 


which means Afe(E x [g t fx) T gt{x )]) = 11(1) by induction on Ct and by the assumption that X k (GjGt ) = 11(1). 
This then leads to A^(Ej [hifxi) T hi(xi )])) = 11(1), which means 6 = 0(1) up to step t. □ 
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